-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make our ship_it.yml GHA workflow resilient #476
Merged
gerhard
merged 1 commit into
thechangelog:master
from
gerhard:make-our-ship_it-workflow-resilient
Jul 31, 2023
Merged
Make our ship_it.yml GHA workflow resilient #476
gerhard
merged 1 commit into
thechangelog:master
from
gerhard:make-our-ship_it-workflow-resilient
Jul 31, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
gerhard
force-pushed
the
make-our-ship_it-workflow-resilient
branch
from
July 30, 2023 15:06
921a242
to
5c13070
Compare
Key takeaway: using our own runners is ~7-8x quicker (regardless whether they run on Fly.io or K8s). Here is the job that ran on my forked repo: https://github.com/gerhard/changelog.com/actions/runs/5706915296
We may want to consider:
|
gerhard
force-pushed
the
make-our-ship_it-workflow-resilient
branch
4 times, most recently
from
July 31, 2023 07:08
cea42b8
to
eda1014
Compare
As you already know, we use Dagger for CI/CD. By default, this runs on Fly.io (via Docker). In some cases, this can fail. The last failure was when DNS resolution stopped working after the Docker instance was auto-upgraded from apps v1 -> v2 (a.k.a. Fly.io machines), e.g. https://github.com/thechangelog/changelog.com/actions/runs/5673476702/attempts/1 As a temporary fix, we had to delete some secrets and re-run the job. The job ran on GHA free runners & failed for genuine reasons 6 mins later: https://github.com/thechangelog/changelog.com/actions/runs/5673476702/job/15395264391 While running on the free GHA runners can be 3x-8x slower, it's a good fall-back. You heard us mention on multiple occasions: "always have redundancies in place". Since we already have multiple CI runtimes in place (Fly.io. K8s), let's make our GHA workflow resilient by: - Run on our preferred back-end by default (Dagger on Fly.io) - ✅ If it succeeds, we are done - ❌ If it fails, fallback to running on the free GitHub runners - In forks, use free GitHub runners by default (we cannot share `secrets`) While this means that a workflow which fails for genuine reasons will fail twice for us (1. Dagger on Fly.io, 2. Dagger on GitHub), it seems like a better place to improve from. This change goes one step further. We are using a third back-end: Dagger on K8s. This uses a self-hosted GitHub runner on K8s which is already integrated with Dagger. For now, we are using it just to see how the CI part compares to our primary setup (Dagger on Fly.io). We are not using Dagger on K8s to deploy the app. Let's see how this setup behaves over a few weeks/months before we consider taking it further. Part of this, we also improved on how we check for Fly.io connectivity. Things that could be improved in follow-ups: - the workflow should succeed if the `dagger-on-github-fallback` job succeeds - currently it fails if `dagger-on-fly-docker` fails - add `dagger-on-k8s` job as secondary fallback - GitHub Actions is currently missing actions/runner#1665 - maybe leverage a Dagger cache that works in forks too 😉 - Run Dagger Engine as a Fly Machine (no more Docker) - thechangelog#471 Signed-off-by: Gerhard Lazu <gerhard@changelog.com>
gerhard
force-pushed
the
make-our-ship_it-workflow-resilient
branch
from
July 31, 2023 07:11
eda1014
to
2fcffd9
Compare
gerhard
added a commit
that referenced
this pull request
Jul 31, 2023
Related to #476 Signed-off-by: Gerhard Lazu <gerhard@changelog.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TL;DR
As you already know, we use Dagger for CI/CD. By default, this runs on Fly.io (via Docker). In some cases, this can fail. The last failure was when DNS resolution stopped working after the Docker instance was auto-upgraded from apps v1 -> v2 (a.k.a. Fly.io machines): https://github.com/thechangelog/changelog.com/actions/runs/5673476702/attempts/1
As a temporary fix, we had to delete some secrets and re-run the job. The job ran on GHA free runners & failed for genuine reasons 6 mins later: https://github.com/thechangelog/changelog.com/actions/runs/5673476702/job/15395264391
While running on the free GHA runners can be 3x-8x slower, it's a good fall-back. You heard us mention on multiple occasions: "always have redundancies in place". Since we already have multiple CI runtimes in place (Fly.io, K8s), let's make our GHA workflow resilient by:
secrets
)While this means that a workflow which fails for genuine reasons will fail twice for us (1. Dagger on Fly.io, 2. Dagger on GitHub), it seems like a better place to improve from.
This change goes one step further. We are using a third back-end: Dagger on K8s. This uses a self-hosted GitHub runner on K8s which is already integrated with Dagger. For now, we are using it just to see how the CI part compares to our primary setup (Dagger on Fly.io). We are not using Dagger on K8s to deploy the app. Let's see how this setup behaves over a few weeks/months before we consider taking it further.
Part of this, we also improved on how we check for Fly.io connectivity.
Things that could be improved in follow-ups:
dagger-on-github-fallback
job succeedsdagger-on-fly-docker
failsdagger-on-k8s
job as secondary fallback